{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1>Creating a searchable Art Database with The MET's open-access collection</h1>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we show how you can enrich data using Cognitive Skills and write to an Azure Search Index using MMLSpark. We use a subset of The MET's open-access collection and enrich it by passing it through 'Describe Image' and a custom 'Image Similarity' skill. The results are then written to a searchable index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np, pandas as pd, os, sys, time, json, requests\n",
    "\n",
    "from mmlspark import *\n",
    "from pyspark.ml.classification import LogisticRegression\n",
    "from pyspark.sql.functions import udf, col\n",
    "from pyspark.sql.types import IntegerType, StringType, DoubleType, StructType, StructField, ArrayType\n",
    "from pyspark.ml import Transformer, Estimator, Pipeline\n",
    "from pyspark.ml.feature import SQLTransformer\n",
    "from pyspark.sql.functions import lit, udf, col, split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "VISION_API_KEY = os.environ['VISION_API_KEY']\n",
    "AZURE_SEARCH_KEY = os.environ['AZURE_SEARCH_KEY']\n",
    "search_service = \"mmlspark-azure-search\"\n",
    "search_index = \"test\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data = spark.read\\\n",
    "  .format(\"csv\")\\\n",
    "  .option(\"header\", True)\\\n",
    "  .load(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/metartworks_sample.csv\")\\\n",
    "  .withColumn(\"searchAction\", lit(\"upload\"))\\\n",
    "  .withColumn(\"Neighbors\", split(col(\"Neighbors\"), \",\").cast(\"array<string>\"))\\\n",
    "  .withColumn(\"Tags\", split(col(\"Tags\"), \",\").cast(\"array<string>\"))\\\n",
    "  .limit(25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworkSamples.png\" width=\"800\" style=\"float: center;\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#define pipeline\n",
    "describeImage = DescribeImage()\\\n",
    "                .setSubscriptionKey(VISION_API_KEY)\\\n",
    "                .setUrl(\"https://eastus.api.cognitive.microsoft.com/vision/v2.0/describe\")\\\n",
    "                .setImageUrlCol(\"PrimaryImageUrl\")\\\n",
    "                .setOutputCol(\"RawImageDescription\")\\\n",
    "                .setConcurrency(5)\\\n",
    "                .setMaxCandidates(5)\n",
    "\n",
    "getDescription = SQLTransformer(statement=\"SELECT *, RawImageDescription.description.captions.text as ImageDescriptions \\\n",
    "                                FROM __THIS__\")\n",
    "\n",
    "getTags = SQLTransformer(statement=\"SELECT *, RawImageDescription.description.tags as ImageTags FROM __THIS__\")\n",
    "                \n",
    "finalcols = SelectColumns().setCols(['ObjectID', 'Department', 'Culture', 'Medium', 'Classification', 'PrimaryImageUrl',\\\n",
    "                                     'Tags', 'Neighbors', 'ImageDescriptions', 'searchAction'])\n",
    "\n",
    "data_processed = Pipeline(stages = [describeImage, getDescription, getTags, finalcols])\\\n",
    "                    .fit(data).transform(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworksProcessed.png\" width=\"800\" style=\"float: center;\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before writing the results to a Search Index, you must define a schema which must specify the name, type, and attributes of each field in your index. Refer [Create a basic index in Azure Search](https://docs.microsoft.com/en-us/azure/search/search-what-is-an-index) for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "index_dict = { \"name\" : search_index, \n",
    "               \"fields\" : [\n",
    "                  {\n",
    "                       \"name\": \"ObjectID\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"key\": True, \n",
    "                       \"facetable\": False\n",
    "                  },\n",
    "                  {\n",
    "                       \"name\": \"Department\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"facetable\": False\n",
    "                  }, \n",
    "                  {\n",
    "                       \"name\": \"Culture\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"facetable\": False\n",
    "                  },\n",
    "                  {\n",
    "                       \"name\": \"Medium\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"facetable\": False\n",
    "                  },\n",
    "                  {\n",
    "                       \"name\": \"Classification\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"facetable\": False\n",
    "                  },\n",
    "                  {\n",
    "                       \"name\": \"PrimaryImageUrl\", \n",
    "                       \"type\": \"Edm.String\", \n",
    "                       \"facetable\": False\n",
    "                  },                   \n",
    "                  {\n",
    "                       \"name\": \"Tags\", \n",
    "                       \"type\": \"Collection(Edm.String)\", \n",
    "                       \"facetable\": False\n",
    "                  },                  \n",
    "                  {\n",
    "                       \"name\": \"Neighbors\", \n",
    "                       \"type\": \"Collection(Edm.String)\", \n",
    "                       \"facetable\": False\n",
    "                  },\n",
    "                  {\n",
    "                       \"name\": \"ImageDescriptions\", \n",
    "                       \"type\": \"Collection(Edm.String)\", \n",
    "                       \"facetable\": False\n",
    "                  }\n",
    "              ]}\n",
    "\n",
    "index_str = json.dumps(index_dict)\n",
    "\n",
    "options = {\n",
    "            \"subscriptionKey\" : AZURE_SEARCH_KEY,\n",
    "            \"actionCol\" : \"searchAction\",\n",
    "            \"serviceName\" : search_service,\n",
    "            \"indexJson\" : index_str\n",
    "          }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "data_processed.writeToAzureSearch(options)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Search Index can be queried using the [Azure Search REST API](https://docs.microsoft.com/rest/api/searchservice/) by sending GET or POST requests and specifying query parameters that give the criteria for selecting matching documents. For more information on querying refer [Query your Azure Search index using the REST API](https://docs.microsoft.com/en-us/rest/api/searchservice/Search-Documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "post_url = 'https://%s.search.windows.net/indexes/%s/docs/search?api-version=2017-11-11' % (search_service, search_index)\n",
    "\n",
    "headers = {\n",
    "    \"Content-Type\":\"application/json\",\n",
    "    \"api-key\": os.environ['AZURE_SEARCH_KEY']\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "body = {\n",
    "    \"search\": \"Glass\",\n",
    "    \"searchFields\": \"Classification\",\n",
    "    \"select\": \"ObjectID, Department, Culture, Medium, Classification, PrimaryImageUrl, Tags, ImageDescriptions\",\n",
    "    \"top\":\"3\"\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "response = requests.post(post_url, json.dumps(body), headers = headers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "response.json()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
